contact: rpauloo@ucdavis.edu
Twitter
Website
This version of the analysis doesn’t contain the source code used in the analysis. Please contact me if that’s something you’re interested in. I apologize in advance for paragraphs that directly reference the not-included code!
In California, whenever a well is drilled, a log of what the driller encounters is recorded and submitted to the state of California. Until late 2017, all well logs in California were privately held data. With the passage of groundwater sustainability legislation in California, this data was released to the public. This data gives us insight into the mysterious groundwater aquifers of California, where surprisingly little data exists for this vital public resource that provides anywhere form 40-60% of the water consumed annually in the state.
The California DWR Online Well Completion Report Database contains ~1,000,000 digitized well logs. What information can be mined from these logs, and what would it tell us about aquifers in California?
A water modeler in a basin might be interested in what hydrogeologic data is available, and how they might use that data to parameterize a groundwater flow or contaminant transport model. Or imagine that a state or federal agency with limited funds to disburse needed to predict a community’s drought vulnerability, or characterize at-risk communities, to mitigate future water shortages. Which are the spatial and socioeconomic patterns in well completion, and are there disparities in who is affected by drought? Lastly, consider a local agency that wants to digitize a portion of well completion reports in their subbasin. Which well reports in their basin offer high quality data, and which should they select?
The purpose of this analysis (mirrored in the table of contents) is to:
To get started, first load up some packages for the analysis and visualization.
Bring in the raw data and inspect each column. We have close to 1,000,000 observations of wells logs, and 46 variables per well log.
## Observations: 943,469
## Variables: 46
## $ WCRNumber <chr> "WCR2016-003980", "WCR2016-004287...
## $ LegacyLogNumber <chr> NA, "e0306175", "e0303164", "e030...
## $ RegionOffice <chr> "DWR Northern Region Office", "DW...
## $ CountyName <chr> "Colusa", "Colusa", "Colusa", "Co...
## $ LocalPermitAgency <chr> "Colusa County Environmental Heal...
## $ PermitDate <chr> "04/12/2016", "1/29/2016", NA, "3...
## $ PermitNumber <chr> "WP 896", "WP860", NA, "WO 088", ...
## $ OwnerAssignedWellNumber <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ WellLocation <chr> NA, "Ohm RD", "Dry Slough RD", "S...
## $ City <chr> NA, "Arbuckle", "Grimes", "Arbuck...
## $ PlannedUseFormerUse <chr> "Water Supply Irrigation - Agricu...
## $ DrillerName <chr> "SULLIVAN DRILLING INC", "SULLIVA...
## $ DrillerLicenseNumber <chr> "656504", "656504", "656504", "65...
## $ RecordType <chr> "WellCompletion/New/Production or...
## $ DecimalLatitude <dbl> 39.06429, 39.06410, 39.10716, 39....
## $ DecimalLongitude <dbl> -122.0507, -122.0321, -121.9406, ...
## $ MethodofDeterminationLL <chr> "Derived from TRS", "Derived from...
## $ LLAccuracy <chr> "Centroid of Section", "Centroid ...
## $ HorizontalDatum <chr> "WGS84", "WGS84", "WGS84", "WGS84...
## $ GroundSurfaceElevation <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ElevationAccuracy <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ElevationDeterminationMethod <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ VerticalDatum <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Township <chr> "14N", "14N", "15N", "14N", "14N"...
## $ Range <chr> "02W", "02W", "01W", "02W", "02W"...
## $ Section <chr> "14", "13", "35", "22", "34", "31...
## $ BaselineMeridian <chr> "Mount Diablo", "Mount Diablo", "...
## $ APN <chr> "018.180.052", "018-180-035", "01...
## $ DateWorkEnded <chr> "5/24/2016 0:00:00", "4/7/2016 0:...
## $ WorkflowStatus <chr> "Completeness Review - Complete -...
## $ ReceivedDate <chr> "6/16/2016 0:00:00", "5/12/2016 0...
## $ TotalDrillDepth <int> 1000, 1000, 820, 500, 910, 1200, ...
## $ TotalCompletedDepth <dbl> 860, 960, 620, 295, 700, 1160, 18...
## $ TopOfPerforatedInterval <dbl> NA, 280, 220, 255, 240, 420, 80, ...
## $ BottomofPerforatedInterval <dbl> NA, 940, 620, 295, 680, 1130, 160...
## $ CasingDiameter <dbl> NA, 16.000, 18.000, 8.000, 17.400...
## $ DrillingMethod <chr> "Reverse Circulation", "Reverse C...
## $ Fluid <chr> "Bentonite", "Bentonite", "Benton...
## $ StaticWaterLevel <dbl> 77, 48, 24, 86, 120, 229, 4, NA, ...
## $ TotalDrawDown <int> NA, NA, NA, NA, NA, 135, 53, NA, ...
## $ TestType <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ PumpTestLength <int> NA, NA, NA, NA, NA, 8, 4, 2, NA, ...
## $ WellYield <dbl> NA, NA, NA, NA, NA, NA, 60.0, NA,...
## $ WellYieldUnitofMeasure <chr> NA, NA, NA, NA, NA, NA, "GPM", NA...
## $ OtherObservations <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ MTRS <chr> "M14N02W14", "M14N02W13", "M15N01...
There’s a wealth of data here, but how much of it is missing? The plot below shows data that is present in purple, and missing data in yellow Each horizontal bar represents one variable in the dataset.
To summarize a few key points:
PlannedUseFormerUse) are missing. These values might be computationally imputated based on existing variables, and variables that may be engineered such as:The plot above shows that about half of our dataset are missing values. However, the glass is half full. The data that is present will be used to answer some interesting questions.
I’m going to start by renaming some variables that will be used a lot in this analysis to make typing easier.
About 5% percent of wells are missing a latitude or longitude.
## [1] 0.05356509
We may be tempeted to throw these observations out because they’re a small fraction of the overall data, but let’s see if we can sensibly impute where wells “should” be from their Township, Range and Section.
For all but 1 of our missing lat/lon, we have a township, range, and section! This is good news, but we’ll see later that Township, Range and Section are pretty messy data.
## [1] 1
Before we impute lat/lon, let’s clean the rest of our lat/lon data, so we only need to do one round of lat/lon imputation.
We’re in California, so all of our latitudes and longitudes should fall within a bounding box:
-124.4096 < Longitude < -114.1308 32.5343 < Latitude < 42.0095
Does our data fall within this bounding box? 100% of our latitudes line up with reality, and more than 99% of our longitudes do so as well. That’s good.
## [1] 99.98253
## [1] 100
Of the < 1% of longitudes that fall outside of California, 141 are recorded as positive. We will multiply these values by -1.
## # A tibble: 149 x 2
## lat lon
## <dbl> <dbl>
## 1 40.5 124.
## 2 40.6 124.
## 3 40.6 124.
## 4 40.6 124.
## 5 35.3 119.
## 6 36.9 120.
## 7 38.1 121.
## 8 38.1 121.
## 9 36.0 121.
## 10 33.6 117.
## # ... with 139 more rows
There are 7 very large negative longitudes. These can be fixed by cross-referencing the township, range and section of these points and identifying the most sensible Longitude. This error is likely to have occured when someone was manually inputting data and entered an extra number. Moreover, there’s one extra large latitude that is clearly a mistakenly placed decimal. It’s only one record, so it didn’t show up earlier in the rounded percentage of longitudes within out bounding box. We’ll correct that too.
## # A tibble: 7 x 2
## lat lon
## <dbl> <dbl>
## 1 38.0 -1233.
## 2 37.9 -1322.
## 3 39.8 -1234.
## 4 38.2 -3122.
## 5 38.9 -1321.
## 6 37.3 -1211.
## 7 38.8 -1523.
## # A tibble: 1 x 2
## lat lon
## <dbl> <dbl>
## 1 388. -122.
To fix the 7 big longitudes, first grab the Township, Range and Section for observations with big longitudes.
There are some errors in the township, range, and section data (which is outside of the scope of this analysis), but the very large negative longitudes belong to townships, ranges and sections that are correct. Comparing the median longitudes of all wells in the missing township, range and section, we can see where input errors took place and sensibly impute what longitude these observations should be.
## # A tibble: 197 x 4
## # Groups: township, range [?]
## township range section median_lon
## <chr> <chr> <chr> <dbl>
## 1 01S 01E 03 -122.
## 2 01S 01E 04 -122.
## 3 01S 01E 20 -122.
## 4 01S 01E 21 -117.
## 5 01S 04W 03 -122.
## 6 01S 04W 04 -117.
## 7 01S 04W 11 -117.
## 8 01S 04W 19 -117.
## 9 01S 04W 20 -117.
## 10 01S 04W 21 -117.
## # ... with 187 more rows
How were these coordinates determined, and how accurate are those predicitons? Nearly all coordinates in this database were derived from the township, range and section, and furthermore are accurate to the centroid of a section. Sections in the Public Land Survey System (Rectangular Survey System) are defined as 1 square mile squares. That means that the great majority of wells in this dataset have points at the center of section centroids, and assuming they are all correctly encoded to the right section, the true location of the well should be within .50 - .71 miles from the given location.
We will assume that these spatial coordinates are close enough to the true values that we can aggregate points into larger subbains, and perform spatial analysis at that aggregated scale, and we will not do any sort of point-pattern analysis at the level of the point.
After cleaning, all of our data nicely plots in California, which is what we expect, and we didn’t have to throw any data out. Some points plot on islands off the California coast; these will be omitted when the analysis is focused on groundwater basins.
We’re interested in groundwater management, so using our clean coordinates we will group wells into subbasins and focus all further analysis on this subset of the data.
To aggregrate our well points into Bulletin 118 ploygons, we’ll perform an intersection. For the math to work, we’ve made sure that our polygons and pts have been transformed into the same projection, or coordinate reference system (crs). About 300,000 wells fall outside of the Bulletin 118 subbasins, and will be omitted from this analysis.
Looks like our intersection worked. Compared to the entire well database, wells are now isolated to the Bulletin 118 groundwater subregions. In the process we trimmed our dataset of wells by about 1/3.
For each variable, duplicate entires that encoded differently should be aggregated. A lemmatization of all well types will increase the utility of the data for statistical models and a deeper understanding of the system.
Here we will clean the variable PlannedUseFormerUse, or what I’ve renamed as type for all data, including data outside of Bulletin 118 subbasins. This varible has 245 unique levels but many of these have the same meaning. Take for instance the first 10 unique entries:
## [1] "Water Supply Irrigation - Agriculture"
## [2] "Water Supply Industrial"
## [3] "Water Supply Domestic"
## [4] "Other Unknown"
## [5] "Other Unknown"
## [6] NA
## [7] "Water Supply Irrigation - Agricultural"
## [8] "Cathodic Protection"
## [9] "Unknown"
## [10] "Other Not Specified"
This list can be grouped into many fewer categories than 245.
The most common well types provides insight into how to begin grouping wells. Here are the top 10 most common well types.
## # A tibble: 10 x 2
## type n
## <chr> <int>
## 1 water supply domestic 355502
## 2 <NA> 200341
## 3 monitoring 126173
## 4 other unused 61280
## 5 water supply irrigation - agriculture 36023
## 6 water supply irrigation - agricultural 31247
## 7 other unknown 21365
## 8 water supply public 14722
## 9 water supply irrigation agricultural 14631
## 10 test well 12011
We can use words indicative of a group to index entries containing those words, and programatically rename them. All categories containing more than 1,000 well types (~0.1% of the data) were perserved as unique categories.
At this point, after covering the top 13 well types, less than 1% of wells remain classified as some form of “other”. Let’s lump them into one category called “other”.
## [1] 0.003260308
Knowing the date the well was completed also gives us valuable information. Let’s create those columns, and save the cleaned data.
## [1] 0.8051181
After grouping well types, we can look at wells in Bulletin 118 subbasins and see that most wells are categorized.
Now we can save this clean data frame for futre anaylsis.
A future analysis might impute missing well types based on:
Outside datasets that would help in imputing these values include:
It’s important to remember that the wells in this dataset only represent reported wells. The number of unreported wells not found in this database are imaginably in the 10s to 100s of thousands.
A simple join with a digital elevation model will add much value to this dataset for relatively little work. This will go some way in resolving where exactly the top and bottom of the perforated interval rests. Any measurement in the z direction is subject sources of uncertainy, including:
For these reasons, in this report, I’ll focus on the thickness of the perforated interval, which is less subject to these uncertainties because it’s simply the difference between the top and bottom of the screen interval.
Where missing values are present (i.e. - PlannedUseFutureUse, well casing diameter, well yield, total drawdown) imputation via sensible value selection or statistical models may offer a way to improve data coverage in areas that lack adequate data. Imputation is a method to complete missing data with statistical models.
Now that we’ve done some data cleaning, let’s ask some interesting questions of the Bulletin 118 wells from 1900 onward, and see what the data might tell us.
We would expect that the number of wells constructed has grown over time to meet the water demands of a growing Californian population. It would also make sense that more wells are drilled during or after droughts (to see what this report considers a drought, please refer to the Appendix). In fact, the 3 driest years on California record occur in 1977, 1991, and 2015, and the data show an increase in wells drilled during those years.
In 1977 and 1991, two of the top three driest years on California record, we see spikes in the number of wells drilled. Following the 2001-2002 drought, from 2003-2005 there’s another spike in well construction. Interestingly, the most recent 2012-2016 drought doesn’t show such a marked spike in well construction as previous events.
Diffenbaugh, Swain, and Touma (2015) show that the simulated probability of dry years has been increasing in California, beginning around ~1980. Interestingly, the number of wells drilled per year hasn’t fallen below levels seen since ~1980.
What are the major well types that aren’t missing? Defined by counts alone, those are domestic, agricultural, monitoring and unused wells.
What about minor well types? Defined again by count, these include Public, injection and remediation wells.
Let’s first take a look at Agricultural wells. Almost all of these subbasins are the in the southern Central Valley’s Tulare Basin.
And now we look at domestic wells. Again we see that with the exception of one subbasin, more wells are completed in southern Central Valley subbasins than any other region.
Taking a look at the map, we see that most of these wells are drilled in California’s Central Valley, with an very large number of wells drilled in the Southern Tulare Basin.
More domestic wells are completed than any other type of well. Domestic wells tend to be screened closer to land surface, and run the risk of drying out before deeper agricultural and public supply wells, as we saw in the most recent drought.
QUESTION: Is the household income (from the UC Census) in the areas with the highest domestic well completion rates lower than California’s median household income?
We will use 2016 tract-level Census data to explore this relationship, and calculate percent difference from median CA household income for each tract. From census.gov:
Census tracts are small, relatively permanent statistical subdivisions of a county or county equivalent and generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people.
Percent difference is calculated as:
\[PD_{i} = \frac{a_{i} - b}{\frac{1}{2}(a_{i} + b)} \cdot 100\]
Where a is the household income of tract i, and b is the median household income for California. The median household income is used as a benchmark for percent difference because the distribution of incomes is left-skewed by a small handful of high income zones, shown on the first tab.
Lastly, we will calculate the weighted percent difference per tract, summed over all census tracts to answer the question above: is the household income (from the UC Census) in the areas with the highest domestic well completion rates lower than California’s median household income? We use tracts over counties or another spatial discretization because by definition, tracts have similar population density.
Tracts are weighted by the number of wells completed within them. The weighted sum over all tracts \(n\) is:
\[\sum_{i=1}^{n}{w_{i} \space p_{i}}\]
where \(w\) is the number of wells in tract \(i\), and \(p\) is the percent difference from CA median household income of that tract. The more wells in a tract, the greater its weight. If this number is very negative, that will suggest that domestic wells are primarily completed in areas with lower median annual income, but it won’t tell us much about the relative difference in areas affected.
Load packages, download data, and do some light cleaning.
Let’s look at all US Census tracts with domestic well completion reports.
Colors correpsond to the difference from median Californian household income ($57,792): red is lower, and blue is higher.
This interactive map has two layers: the first shows only tracts with domestic well completion reports. Colors correpsond to the difference from median Californian household income ($57,792): red is lower, and blue is higher. Centers of high income (blue) tend to cluster in urban and suburban zones. The second layer shows the median household income for each tract. Lighter colors correspond to more income, and align with urban centers.
Finally, we calculate the weighted percent difference from CA median household income, summed over all tracts, which results in a positive number.
\[\sum_{i=1}^{n}{w_{i} \space p_{i}}\]
This means that more domstic wells are drilled in census tracts above the CA median household income than tracts below the median household income. But how big of a difference? We can compute a weighted percent difference sum for tracts above and below, normalize them, and compare.
## [1] 184243.7
That the relative weighted sums of percent difference are within 10% of each other suggests that more domestic wells are drilled in wealthier tracts, and that tracts above and below the California median household income are almost equally affected by the need to drill wells.
What can we learn from exploring this map?
Droughts and water shortage affect both high-SES and low-SES tracts.
However, low-SES tracts will be less financially equipped to deal with groundwater shortages and the costs of completing a new well.
Low-SES tracts tend to be larger in area and geographically isolated from urban centers. When water shortages occur, these tracts are are less likely to be within range of a public water system.
method of determination groups some wells in the center of counties.
Not all census zones have well completion reports.
A quick examination of the map reveals large clusters of monitoring wells in the urban zones of San Deigo, Los Angeles, the Bay Area, Sacramento, and to a lesser extent, the Central Valley.
Injection wells are a relatively new technology, and have not achieved broad use. From a map of counts, it appears that most injection well completion has taken place in Los Angeles and the South Bay, with a few clusters in the northern Sacramento region. Injection wells are both expensive and require extra water to bank, which is why they have probably seen more adoption in northern regions.
Are more wells drilled in the summer months when the weather favors working outside, when the sunlight hours are longer, and when groundwater is in greater demand? In fact, compared to winter months, there’s around a 25-50% increase in drilling during the summer months. Not a surprising result, but reaffirming to see that the data supports this hypothesis.
Is there a relationship between perforated interval thickness and time? Are we screening thicker intervals with time?
Data density as the number of wells per surface area of the subbasin. This metric doesn’t discriminate between high and low quality data–it’s simply a measure of abundance. We define data density (\(\rho_{i}\)) of a subbasin \(i\) as the number of wells completion reports (\(n_{i}\)) per square kilometer of basin (\(A\)). In other words,
\[\rho_{i} = \frac{n_{i}}{A}\]
We see that most subbasins have less than 10 wells per square kilometer. A few small subbasins scattered throughout the state have no well completion reports, and therefore a data density of 0. It seems highly unlikely that there are any subbasins with zero well completion reports, and this may be due to errors in latitude and longitude determination.
In particular, this data can tell us about:
The last three parameters are relatively straightforward and included as columns in the data. Aquifer transmissivity can be predicted from emperical relationships between known aquifer transmissivities and specific capcity. Specific capacity can be calculated from pumping test parameters; specifically it is defined as the ratio of outflow to drawdown, or
\[S_{c} = \frac{Q}{s}\]
Razack and Huntley (1991) studied the relationship between transmissivity (\(T\)) and specific capacity in a large heterogeneous alluvial groundwater basin in Morocco. The study included 215 wells for which \(S_{c}\) and \(T\) were known, and defined an emperical relationship between these parameters, with a correlation coefficient of 0.63. The general form of the equation is given as
\[T = K(S_c)^{0.67}\]
where the constant \(K = 106\) for \(S_{c}\) in units of [\(\frac{gpm}{ft}\)] and \(T\) in units of [\(\frac{m^2}{d}\)].
As we saw in the first figure, not all wells have pump test data. We will estimate \(S_{c}\) and \(T\) for only wells where it is possible, and pumping test data is present.
Transmissivities appear to take a log-normal distribution, which is not uncommon for environmental data. After log transforming the x axis, it appears that transmissivities take on a normal to bimodal shape. This could be due to the fact that the emperical relationship used to compute transmissivities is based on data from an alluvial aquifer, and may not hold for volcanic fractured rock systems, which are abundant in Northern Californian subbasins. We could also be seeing a shallow and deep aquifer in the data.
How are these data spatially distributed? It appears that well completion reports have decent spatial coverage in the Northern central valley, and isolated northern fractured rock subbasins, the Delta, the North Bay basins, the Santa Barabara region, and in the eastern side Tulare Basin.
A future analysis should join these data to a map of aquifer geology and apply an appropriate emperical formula dependent on the the aquifer rock type. This calculation is outside of the scope of this analysis, but it should be noted that Mace (1997) defined an emperical relationship (with a correlation coefficient of 0.891) for determining the transmissivity of a fractured rock aquifer, albeit across a set of karstic aquifers in Texas, as
\[T = 0.76(S_c)^{1.08}\]
where \(T\) and \(S_{c}\) are both in units of [\(\frac{m^2}{d}\)].
Aquifer property data like transmissivity can be of great value to developing groundwater sustainability agencies tasked with paramaterizing groundwater flow and transport models, as well as regulatory bodies assessing the reasonability of such models. It is important to remember that these transmissivities assume alluvial material, and there is unexplained variance in the model used by Razack and Huntley (1991), as well as sample bias in that only one aquifer was sampled. The study by Mace (1997) sampled many aquifers, which partially explains the higher correlation coefficient compared to Mace (1997). Nonetheless, to view big-picture trends, we can we can aggregate points on a subbasin level and plot the median transmissivity per subbain. There are a few outliers with exceptionally high \(T\) that distort the scale, and they are not plotted so that the main trends in transmissivity can be seen.
A next step might be to use ranges of perforated interval thickness over which these tranmissivities were derived to compute upscaled hydraulic conducitivities by the emperical relationship:
\[K = \frac{T}{b}\]
Moreover, it might be useful to create a basin-by-basin report on the available aquifer properties to assist modelers in creating groundwater flow models.
How deep are different well types drilled?
How much of the aquifer is screened? Where it is screened relative to land surface? Production wells are designed to be screened in aquifer materials–the highly conductive sands and gravels that facilitate fluid flow. In contrast, monitoring wells may be screened across the entire casing length. Agricultural wells which have less stringent requirements for water quality may be screened at greater lengths than public supply wells, which are more tightly regulated. Knowing how much of an aquifer is screened gives us an idea of how much an aquifer is being utilized.
Domestic wells are also generally more shallow than public and agricultural wells. This makes them more susceptible to droughts, to top-down pollutants like LNAPLs, DNAPLs, and to non-point source pollutants like salts and nitrates.
Here we calculate mean and median perforated interval thickness, as well as the top and bottom of the perforated interval for all wells in Bulletin 188 subbasins, for domestic and agricultural wells. These data can also be visualized interactively at this blog post.
Lastly, we can use the static water level in domestic, agricultural, and public supply wells to visualize the change from the first recorded measurements. There is more of a negative change in domestic (shallow) wells compared to agricultural and public (deep) wells, which is not surprising. It also appears that in some groundwater basins, static water level has risen. This occurs primarily in small basins with little data to begin with, and may reflect a nuiance in the data caused by a sensitivty to low starting values.
From the United States Geologic Survey:
…droughts in California can be classified in four ways:
- Meteorological drought is a period of one, or more water years, of below-normal precipitation;
- Hydrological drought is a period of one, or more water years, in which there is below-normal availability of surface water and groundwater;
- Agricultural drought is a period of one, or more water years, in which water available for agricultural production is curtailed by 25% or more; and
- Ecological drought is a period of one, or more water years, during which deficits in natural water availability create multiple stressors across ecosystems.
Since 1895, there have been six prolonged dry periods lasting two years or longer, which qualify as droughts under all of the above drought classifications. They are: water years (WY) 1928-34, WY 1976-77, WY 1987-92, WY 2001-02, WY 2007-09 and WY 2012-16.
California Online State Well Completion Reports
Diffenbaugh, Noah S., Daniel L. Swain, and Danielle Touma. 2015. “Anthropogenic warming has increased drought risk in California.” Proceedings of the National Academy of Sciences 112 (13): 3931–6. doi:10.1073/pnas.1422385112.
Mace, Robert. 1997. “Determination of Transmissivity from Specific Capacity Tests in a Karst Aquifer.” Groundwater 35 (5).
Razack, M, and David Huntley. 1991. “Assessing Transmissivity from Specific Capacity in a Large and Heterogeneous Alluvial Aquifer.” Groundwater 29 (6).